{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "environment": {
      "name": "tf2-gpu.2-1.m48",
      "type": "gcloud",
      "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-1:m48"
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.6"
    },
    "colab": {
      "name": "Multi-GPU Training on CAIP",
      "provenance": [],
      "collapsed_sections": []
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "N22NDjIjjxcI"
      },
      "source": [
        "# Distributed Training with GPUs on Cloud AI Platform\n",
        "\n",
        "**Learning Objectives:**\n",
        "  1. Setting up the environment\n",
        "  1. Create a model to train locally\n",
        "  1. Train on multiple GPUs/CPUs with MultiWorkerMirrored Strategy\n",
        "\n",
        "In this notebook, we will walk through using Cloud AI Platform to perform distributed training using the `MirroredStrategy` found within `tf.keras`. This strategy will allow us to use the synchronous AllReduce strategy on a VM with multiple GPUs attached.\n",
        "\n",
        "Each learning objective will correspond to a __#TODO__ in the [student lab notebook](https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive2/production_ml/labs/distributed_training.ipynb) -- try to complete that notebook first before reviewing this solution notebook."

      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Nny3m465gKkY",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Use the chown command to change the ownership of repository to user\n",
        "!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "63e-3SdLlh7K"
      },
      "source": [
        "Next we will configure our environment. Be sure to change the `PROJECT_ID` variable in the below cell to your Project ID. This will be the project to which the Cloud AI Platform resources will be billed. We will also create a bucket for our training artifacts (if it does not already exist)."
      ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Lab Task #1: Setting up the environment\n"
     ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TKZYJbBkBcOk",
        "outputId": "8dac5564-c864-4db8-9cc2-0c8036b75eb8",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 88
        }
      },
      "source": [
        "# The OS module in python provides functions for interacting with the operating system\n",
        "import os\n",
        "# TODO 1\n",
        "PROJECT_ID = \"cloud-training-demos\"  # Replace with your PROJECT\n",
        "BUCKET = PROJECT_ID \n",
        "REGION = 'us-central1'\n",
        "# Store the value of `BUCKET` and `PROJECT_ID` in environment variables.\n",
        "os.environ[\"PROJECT_ID\"] = PROJECT_ID\n",
        "os.environ[\"BUCKET\"] = BUCKET\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Jth9W8NtmNUD"
      },
      "source": [
        "Since we are going to submit our training job to Cloud AI Platform, we need to create our trainer package. We will create the `train` directory for our package and create a blank `__init__.py` file so Python knows that this folder contains a package."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cavM78bf_mU4"
      },
      "source": [
        "# Using `mkdir` we can create an empty directory\n",
        "!mkdir train\n",
        "# Using `touch` we can create an empty file\n",
        "!touch train/__init__.py"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PIMQo_lImhn_"
      },
      "source": [
        "Next we will create a module containing a function which will create our model. Note that we will be using the Fashion MNIST dataset. Since it's a small dataset, we will simply load it into memory for getting the parameters for our model.\n",
        "\n",
        "Our model will be a DNN with only dense layers, applying dropout to each hidden layer. We will also use ReLU activation for all hidden layers."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "88-9WeCQ_mU9",
        "outputId": "ae92afd1-93bd-49e5-aeda-6f6e177ac186",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "%%writefile train/model_definition.py\n",
        "# Here we'll import data processing libraries like Numpy and Tensorflow\n",
        "import tensorflow as tf\n",
        "import numpy as np\n",
        "\n",
        "# Get data\n",
        "\n",
        "(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()\n",
        "\n",
        "# add empty color dimension\n",
        "x_train = np.expand_dims(x_train, -1)\n",
        "x_test = np.expand_dims(x_test, -1)\n",
        "\n",
        "def create_model():\n",
        "# The `tf.keras.Sequential` method will sequential groups a linear stack of layers into a tf.keras.Model.\n",
        "    model = tf.keras.models.Sequential()\n",
        "# The `Flatten()` method will flattens the input and it does not affect the batch size.\n",
        "    model.add(tf.keras.layers.Flatten(input_shape=x_train.shape[1:]))\n",
        "# The `Dense()` method is just your regular densely-connected NN layer.\n",
        "    model.add(tf.keras.layers.Dense(1028))\n",
        "# The `Activation()` method applies an activation function to an output.\n",
        "    model.add(tf.keras.layers.Activation('relu'))\n",
        "# The `Dropout()` method applies dropout to the input.\n",
        "    model.add(tf.keras.layers.Dropout(0.5))\n",
        "# The `Dense()` method is just your regular densely-connected NN layer.\n",
        "    model.add(tf.keras.layers.Dense(512))\n",
        "# The `Activation()` method applies an activation function to an output.\n",
        "    model.add(tf.keras.layers.Activation('relu'))\n",
        "# The `Dropout()` method applies dropout to the input.\n",
        "    model.add(tf.keras.layers.Dropout(0.5))\n",
        "# The `Dense()` method is just your regular densely-connected NN layer.\n",
        "    model.add(tf.keras.layers.Dense(256))\n",
        "# The `Activation()` method applies an activation function to an output.\n",
        "    model.add(tf.keras.layers.Activation('relu'))\n",
        "# The `Dropout()` method applies dropout to the input.\n",
        "    model.add(tf.keras.layers.Dropout(0.5))\n",
        "# The `Dense()` method is just your regular densely-connected NN layer.\n",
        "    model.add(tf.keras.layers.Dense(10))\n",
        "# The `Activation()` method applies an activation function to an output.\n",
        "    model.add(tf.keras.layers.Activation('softmax'))\n",
        "    return model"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Writing train/model_definition.py\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0DOKsnDhnU87"
      },
      "source": [
        "Before we submit our training jobs to Cloud AI Platform, let's be sure our model runs locally. We will call the `model_definition` function to create our model and use `tf.keras.datasets.fashion_mnist.load_data()` to import the Fashion MNIST dataset."
      ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Lab Task #2: Create a model to train locally\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "r8bPcX7T-SH1",
        "outputId": "1069992b-0599-4b19-f7e8-b4cefa7cb6bb",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 215
        }
      },
      "source": [
        "# The OS module in python provides functions for interacting with the operating system\n",
        "import os\n",
        "# The Python time module provides many ways of representing time in code, such as objects, numbers, and strings.\n",
        "# It also provides functionality other than representing time, like waiting during code execution and measuring the efficiency of your code.\n",
        "import time\n",
        "# Here we'll import data processing libraries like Numpy and Tensorflow\n",
        "import tensorflow as tf\n",
        "import numpy as np\n",
        "from train import model_definition\n",
        "\n",
        "#Get data\n",
        "# TODO 2\n",
        "\n",
        "(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()\n",
        "\n",
        "# add empty color dimension\n",
        "x_train = np.expand_dims(x_train, -1)\n",
        "x_test = np.expand_dims(x_test, -1)\n",
        "\n",
        "def create_dataset(X, Y, epochs, batch_size):\n",
        "    dataset = tf.data.Dataset.from_tensor_slices((X, Y))\n",
        "    dataset = dataset.repeat(epochs).batch(batch_size, drop_remainder=True)\n",
        "    return dataset\n",
        "\n",
        "ds_train = create_dataset(x_train, y_train, 20, 5000)\n",
        "ds_test = create_dataset(x_test, y_test, 1, 1000)\n",
        "\n",
        "model = model_definition.create_model()\n",
        "\n",
        "model.compile(\n",
        "# Using `tf.keras.optimizers.Adam` the optimizer will implements the Adam algorithm.\n",
        "  optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3, ),\n",
        "  loss='sparse_categorical_crossentropy',\n",
        "  metrics=['sparse_categorical_accuracy'])\n",
        "    \n",
        "start = time.time()\n",
        "\n",
        "model.fit(\n",
        "    ds_train,\n",
        "    validation_data=ds_test, \n",
        "    verbose=1\n",
        ")\n",
        "print(\"Training time without GPUs locally: {}\".format(time.time() - start))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz\n",
            "32768/29515 [=================================] - 0s 0us/step\n",
            "Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz\n",
            "26427392/26421880 [==============================] - 0s 0us/step\n",
            "Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz\n",
            "8192/5148 [===============================================] - 0s 0us/step\n",
            "Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz\n",
            "4423680/4422102 [==============================] - 0s 0us/step\n",
            "240/240 [==============================] - 174s 725ms/step - loss: 4.1184 - sparse_categorical_accuracy: 0.6367 - val_loss: 0.6234 - val_sparse_categorical_accuracy: 0.7880\n",
            "Training time without GPUs locally: 175.62197422981262\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "L_U-u_tZ_mVK"
      },
      "source": [
        "\n",
        "\n",
        "## Train on multiple GPUs/CPUs with MultiWorkerMirrored Strategy\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "P0VQ7GUsn8wb"
      },
      "source": [
        "That took a few minutes to train our model for 20 epochs. Let's see how we can do better using Cloud AI Platform. We will be leveraging the `MultiWorkerMirroredStrategy` supplied in `tf.distribute`. The main difference between this code and the code from the local test is that we need to compile the model within the scope of the strategy. When we do this our training op will use information stored in the `TF_CONFIG` variable to assign ops to the various devices for the AllReduce strategy. \n",
        "\n",
        "After the training process finishes, we will print out the time spent training. Since it takes a few minutes to spin up the resources being used for training on Cloud AI Platform, and this time can vary, we want a consistent measure of how long training took.\n",
        "\n",
        "Note: When we train models on Cloud AI Platform, the `TF_CONFIG` variable is automatically set. So we do not need to worry about adjusting based on what cluster configuration we use."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_AF4VDhT_mVg",
        "outputId": "e2e1a496-a369-4f62-856a-b2fe88178add",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "%%writefile train/train_mult_worker_mirrored.py\n",
        "# The OS module in python provides functions for interacting with the operating system\n",
        "import os\n",
        "# The Python time module provides many ways of representing time in code, such as objects, numbers, and strings.\n",
        "# It also provides functionality other than representing time, like waiting during code execution and measuring the efficiency of your code.\n",
        "import time\n",
        "# Here we'll import data processing libraries like Numpy and Tensorflow\n",
        "import tensorflow as tf\n",
        "import numpy as np\n",
        "from . import model_definition\n",
        "\n",
        "# The `MultiWorkerMirroredStrategy()` method will work as a distribution strategy for synchronous training on multiple workers.\n",
        "strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()\n",
        "\n",
        "#Get data\n",
        "\n",
        "(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()\n",
        "\n",
        "# add empty color dimension\n",
        "x_train = np.expand_dims(x_train, -1)\n",
        "x_test = np.expand_dims(x_test, -1)\n",
        "\n",
        "def create_dataset(X, Y, epochs, batch_size):\n",
        "    dataset = tf.data.Dataset.from_tensor_slices((X, Y))\n",
        "    dataset = dataset.repeat(epochs).batch(batch_size, drop_remainder=True)\n",
        "    return dataset\n",
        "\n",
        "ds_train = create_dataset(x_train, y_train, 20, 5000)\n",
        "ds_test = create_dataset(x_test, y_test, 1, 1000)\n",
        "\n",
        "print('Number of devices: {}'.format(strategy.num_replicas_in_sync))\n",
        "\n",
        "with strategy.scope():\n",
        "    model = model_definition.create_model()\n",
        "    model.compile(\n",
        "      optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3, ),\n",
        "      loss='sparse_categorical_crossentropy',\n",
        "      metrics=['sparse_categorical_accuracy'])\n",
        "    \n",
        "start = time.time()\n",
        "\n",
        "model.fit(\n",
        "    ds_train,\n",
        "    validation_data=ds_test, \n",
        "    verbose=2\n",
        ")\n",
        "print(\"Training time with multiple GPUs: {}\".format(time.time() - start))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Writing train/train_mult_worker_mirrored.py\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Lab Task #3: Training with multiple GPUs/CPUs on created model using MultiWorkerMirrored Strategy\n"
     ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ViUZcz7Tp9Kp"
      },
      "source": [
        "First we will train a model without using GPUs to give us a baseline. We will use a consistent format throughout the trials. We will define a `config.yaml` file to contain our cluster configuration and the pass this file in as the value of a command-line argument `--config`.\n",
        "\n",
        "In our first example, we will use a single `n1-highcpu-16` VM."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sBJw5hJAadVW",
        "outputId": "8f377e27-e1c3-4a44-e5ef-21c37a7f64fd",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "%%writefile config.yaml\n",
        "# TODO 3a\n",
        "# Configure a master worker\n",
        "trainingInput:\n",
        "  scaleTier: CUSTOM\n",
        "  masterType: n1-highcpu-16"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Writing config.yaml\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_mlylgvCaeXW",
        "outputId": "383eb016-e791-4cfc-f382-72acd47932b4",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 195
        }
      },
      "source": [
        "%%bash\n",
        "\n",
        "now=$(date +\"%Y%m%d_%H%M%S\")\n",
        "JOB_NAME=\"cpu_only_fashion_minst_$now\"\n",
        "\n",
        "gcloud ai-platform jobs submit training $JOB_NAME \\\n",
        "  --staging-bucket=gs://$BUCKET \\\n",
        "  --package-path=train \\\n",
        "  --module-name=train.train_mult_worker_mirrored \\\n",
        "  --runtime-version=2.3 \\\n",
        "  --python-version=3.7 \\\n",
        "  --region=us-west1 \\\n",
        "  --config config.yaml"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "jobId: cpu_only_fashion_minst_20200903_154222\n",
            "state: QUEUED\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": [
            "Job [cpu_only_fashion_minst_20200903_154222] submitted successfully.\n",
            "Your job is still active. You may view the status of your job with the command\n",
            "\n",
            "  $ gcloud ai-platform jobs describe cpu_only_fashion_minst_20200903_154222\n",
            "\n",
            "or continue streaming the logs with the command\n",
            "\n",
            "  $ gcloud ai-platform jobs stream-logs cpu_only_fashion_minst_20200903_154222\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "g9tji3XRqbi-"
      },
      "source": [
        "If we go through the logs, we see that the training job will take around 5-7 minutes to complete. Let's now attach two Nvidia Tesla K80 GPUs and rerun the training job."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fCGWARBH_mVi",
        "outputId": "64f42735-7356-40c4-8460-afc096a9896c",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "%%writefile config.yaml\n",
        "# TODO 3b\n",
        "# Configure a master worker\n",
        "trainingInput:\n",
        "  scaleTier: CUSTOM\n",
        "  masterType: n1-highcpu-16\n",
        "  masterConfig:\n",
        "    acceleratorConfig:\n",
        "      count: 2\n",
        "      type: NVIDIA_TESLA_K80"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Overwriting config.yaml\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UmXeeg6r_mVk",
        "outputId": "0017abd4-077d-4089-f664-d15dac69f755",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 195
        }
      },
      "source": [
        "%%bash\n",
        "\n",
        "now=$(date +\"%Y%m%d_%H%M%S\")\n",
        "JOB_NAME=\"multi_gpu_fashion_minst_2gpu_$now\"\n",
        "\n",
        "gcloud ai-platform jobs submit training $JOB_NAME \\\n",
        "  --staging-bucket=gs://$BUCKET \\\n",
        "  --package-path=train \\\n",
        "  --module-name=train.train_mult_worker_mirrored \\\n",
        "  --runtime-version=2.3 \\\n",
        "  --python-version=3.7 \\\n",
        "  --region=us-west1 \\\n",
        "  --config config.yaml"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "jobId: multi_gpu_fashion_minst_2gpu_20200903_154225\n",
            "state: QUEUED\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": [
            "Job [multi_gpu_fashion_minst_2gpu_20200903_154225] submitted successfully.\n",
            "Your job is still active. You may view the status of your job with the command\n",
            "\n",
            "  $ gcloud ai-platform jobs describe multi_gpu_fashion_minst_2gpu_20200903_154225\n",
            "\n",
            "or continue streaming the logs with the command\n",
            "\n",
            "  $ gcloud ai-platform jobs stream-logs multi_gpu_fashion_minst_2gpu_20200903_154225\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0fMLxLOgq7mc"
      },
      "source": [
        "That was a lot faster! The training job will take upto 5-10 minutes to complete. Let's keep going and add more GPUs!"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ocXYk61hGRG_",
        "outputId": "ff6d1dc4-ab02-4ab7-acea-2420c5c1d5aa",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "%%writefile config.yaml\n",
        "# TODO 3c\n",
        "# Configure a master worker\n",
        "trainingInput:\n",
        "  scaleTier: CUSTOM\n",
        "  masterType: n1-highcpu-16\n",
        "  masterConfig:\n",
        "    acceleratorConfig:\n",
        "      count: 4\n",
        "      type: NVIDIA_TESLA_K80"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Overwriting config.yaml\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "MKRRVDLhZoj3",
        "outputId": "4012bb48-d7f9-4e6b-d034-daa258cc636f",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 195
        }
      },
      "source": [
        "%%bash\n",
        "\n",
        "now=$(date +\"%Y%m%d_%H%M%S\")\n",
        "JOB_NAME=\"multi_gpu_fashion_minst_4gpu_$now\"\n",
        "\n",
        "gcloud ai-platform jobs submit training $JOB_NAME \\\n",
        "  --staging-bucket=gs://$BUCKET \\\n",
        "  --package-path=train \\\n",
        "  --module-name=train.train_mult_worker_mirrored \\\n",
        "  --runtime-version=2.3 \\\n",
        "  --python-version=3.7 \\\n",
        "  --region=us-west1 \\\n",
        "  --config config.yaml"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "jobId: multi_gpu_fashion_minst_4gpu_20200903_154228\n",
            "state: QUEUED\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": [
            "Job [multi_gpu_fashion_minst_4gpu_20200903_154228] submitted successfully.\n",
            "Your job is still active. You may view the status of your job with the command\n",
            "\n",
            "  $ gcloud ai-platform jobs describe multi_gpu_fashion_minst_4gpu_20200903_154228\n",
            "\n",
            "or continue streaming the logs with the command\n",
            "\n",
            "  $ gcloud ai-platform jobs stream-logs multi_gpu_fashion_minst_4gpu_20200903_154228\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F3PfreD5rfgE"
      },
      "source": [
        "The training job will take upto 10 minutes to complete. It was faster than no GPUs, but why was it slower than 2 GPUs? If you rerun this job with 8 GPUs you'll actually see it takes just as long as using no GPUs!\n",
        "\n",
        "The answer is in our input pipeline. In short, the I/O involved in using more GPUs started to outweigh the benefits of having more available devices. We can try to improve our input pipelines to overcome this (e.g. using caching, adjusting batch size, etc.). \n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Z8pPmHAbn4CB"
      },
      "source": [
        ""
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}
